training site
DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities
Zhuang, Tianyi, Kuang, Chuqiao, Li, Xiaoguang, Teng, Yihua, Wu, Jihao, Wang, Yasheng, Shang, Lifeng
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.
- Asia > China > Tianjin Province > Tianjin (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
Robust Learning via Conditional Prevalence Adjustment
Nguyen, Minh, Wang, Alan Q., Kim, Heejong, Sabuncu, Mert R.
Healthcare data often come from multiple sites in which the correlations between confounding variables can vary widely. If deep learning models exploit these unstable correlations, they might fail catastrophically in unseen sites. Although many methods have been proposed to tackle unstable correlations, each has its limitations. For example, adversarial training forces models to completely ignore unstable correlations, but doing so may lead to poor predictive performance. Other methods (e.g. Invariant risk minimization [4]) try to learn domain-invariant representations that rely only on stable associations by assuming a causal data-generating process (input X causes class label Y ). Thus, they may be ineffective for anti-causal tasks (Y causes X), which are common in computer vision. We propose a method called CoPA (Conditional Prevalence-Adjustment) for anti-causal tasks. CoPA assumes that (1) generation mechanism is stable, i.e. label Y and confounding variable(s) Z generate X, and (2) the unstable conditional prevalence in each site E fully accounts for the unstable correlations between X and Y . Our crucial observation is that confounding variables are routinely recorded in healthcare settings and the prevalence can be readily estimated, for example, from a set of (Y, Z) samples (no need for corresponding samples of X). CoPA can work even if there is a single training site, a scenario which is often overlooked by existing methods. Our experiments on synthetic and real data show CoPA beating competitive baselines.
- Oceania > Australia > Queensland (0.04)
- Europe > Austria > Vienna (0.04)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Nuclear Medicine (0.93)
- Health & Medicine > Therapeutic Area > Oncology (0.69)
Minecraft Used as Training Site for New AI
Minecraft is one of the most popular games of our time. Known for its block-based aesthetics and almost limitless exploration, the game allows people to do almost whatever they want. Because of the endless possibilities afforded by the game, Minecraft has been central to an artificial intelligence (AI) research group. Reported by the MIT Technology Review, researchers at Facebook have been using Minecraft as their training system for their new AI. While most contemporary successful AI systems are designed to excel at one task, these researchers are looking to create AI that can do multiple things well as it assists a player in their Minecraft adventure.